<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
Code to maintain and access indices.
<!-- TODO: add IndexWriter, IndexWriterConfig, DocValues, etc etc -->
<h2>Table Of Contents</h2>
<p>
    <ol>
        <li><a href="#postings">Postings APIs</a>
            <ul>
                <li><a href="#fields">Fields</a></li>
                <li><a href="#terms">Terms</a></li>
                <li><a href="#documents">Documents</a></li>
                <li><a href="#positions">Positions</a></li>
            </ul>
        </li>
        <li><a href="#stats">Index Statistics</a>
            <ul>
                <li><a href="#termstats">Term-level</a></li>
                <li><a href="#fieldstats">Field-level</a></li>
                <li><a href="#segmentstats">Segment-level</a></li>
                <li><a href="#documentstats">Document-level</a></li>
            </ul>
        </li>
    </ol>
</p>
<a name="postings"></a>
<h2>Postings APIs</h2>
<a name="fields"></a>
<h4>
    Fields
</h4>
<p>
{@link org.apache.lucene.index.Fields} is the initial entry point into the 
postings APIs, this can be obtained in several ways:
<pre class="prettyprint">
// access indexed fields for an index segment
Fields fields = reader.fields();
// access term vector fields for a specified document
Fields fields = reader.getTermVectors(docid);
</pre>
Fields implements Java's Iterable interface, so its easy to enumerate the
list of fields:
<pre class="prettyprint">
// enumerate list of fields
for (String field : fields) {
  // access the terms for this field
  Terms terms = fields.terms(field);
}
</pre>
</p>
<a name="terms"></a>
<h4>
    Terms
</h4>
<p>
{@link org.apache.lucene.index.Terms} represents the collection of terms
within a field, exposes some metadata and <a href="#fieldstats">statistics</a>,
and an API for enumeration.
<pre class="prettyprint">
// metadata about the field
System.out.println("positions? " + terms.hasPositions());
System.out.println("offsets? " + terms.hasOffsets());
System.out.println("payloads? " + terms.hasPayloads());
// iterate through terms
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
while ((term = termsEnum.next()) != null) {
  doSomethingWith(termsEnum.term());
}
</pre>
{@link org.apache.lucene.index.TermsEnum} provides an iterator over the list
of terms within a field, some <a href="#termstats">statistics</a> about the term,
and methods to access the term's <a href="#documents">documents</a> and
<a href="#positions">positions</a>.
<pre class="prettyprint">
// seek to a specific term
boolean found = termsEnum.seekExact(new BytesRef("foobar"));
if (found) {
  // get the document frequency
  System.out.println(termsEnum.docFreq());
  // enumerate through documents
  DocsEnum docs = termsEnum.docs(null, null);
  // enumerate through documents and positions
  DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null);
}
</pre>
</p>
<a name="documents"></a>
<h4>
    Documents
</h4>
<p>
{@link org.apache.lucene.index.DocsEnum} is an extension of 
{@link org.apache.lucene.search.DocIdSetIterator}that iterates over the list of
documents for a term, along with the term frequency within that document.
<pre class="prettyprint">
int docid;
while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  System.out.println(docid);
  System.out.println(docsEnum.freq());
}
</pre>
</p>
<a name="positions"></a>
<h4>
    Positions
</h4>
<p>
{@link org.apache.lucene.index.DocsAndPositionsEnum} is an extension of 
{@link org.apache.lucene.index.DocsEnum} that additionally allows iteration
of the positions a term occurred within the document, and any additional
per-position information (offsets and payload)
<pre class="prettyprint">
int docid;
while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  System.out.println(docid);
  int freq = docsAndPositionsEnum.freq();
  for (int i = 0; i < freq; i++) {
     System.out.println(docsAndPositionsEnum.nextPosition());
     System.out.println(docsAndPositionsEnum.startOffset());
     System.out.println(docsAndPositionsEnum.endOffset());
     System.out.println(docsAndPositionsEnum.getPayload());
  }
}
</pre>
</p>
<a name="stats"></a>
<h2>Index Statistics</h2>
<a name="termstats"></a>
<h4>
    Term statistics
</h4>
<p>
    <ul>
       <li>{@link org.apache.lucene.index.TermsEnum#docFreq}: Returns the number of 
           documents that contain at least one occurrence of the term. This statistic 
           is always available for an indexed term. Note that it will also count 
           deleted documents, when segments are merged the statistic is updated as 
           those deleted documents are merged away.
       <li>{@link org.apache.lucene.index.TermsEnum#totalTermFreq}: Returns the number 
           of occurrences of this term across all documents. Note that this statistic 
           is unavailable (returns <code>-1</code>) if term frequencies were omitted 
           from the index 
           ({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY}) 
           for the field. Like docFreq(), it will also count occurrences that appear in 
           deleted documents.
    </ul>
</p>
<a name="fieldstats"></a>
<h4>
    Field statistics
</h4>
<p>
    <ul>
       <li>{@link org.apache.lucene.index.Terms#size}: Returns the number of 
           unique terms in the field. This statistic may be unavailable 
           (returns <code>-1</code>) for some Terms implementations such as
           {@link org.apache.lucene.index.MultiTerms}, where it cannot be efficiently
           computed.  Note that this count also includes terms that appear only
           in deleted documents: when segments are merged such terms are also merged
           away and the statistic is then updated.
       <li>{@link org.apache.lucene.index.Terms#getDocCount}: Returns the number of
           documents that contain at least one occurrence of any term for this field.
           This can be thought of as a Field-level docFreq(). Like docFreq() it will
           also count deleted documents.
       <li>{@link org.apache.lucene.index.Terms#getSumDocFreq}: Returns the number of
           postings (term-document mappings in the inverted index) for the field. This
           can be thought of as the sum of {@link org.apache.lucene.index.TermsEnum#docFreq}
           across all terms in the field, and like docFreq() it will also count postings
           that appear in deleted documents.
       <li>{@link org.apache.lucene.index.Terms#getSumTotalTermFreq}: Returns the number
           of tokens for the field. This can be thought of as the sum of 
           {@link org.apache.lucene.index.TermsEnum#totalTermFreq} across all terms in the
           field, and like totalTermFreq() it will also count occurrences that appear in
           deleted documents, and will be unavailable (returns <code>-1</code>) if term 
           frequencies were omitted from the index 
           ({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY}) 
           for the field.
    </ul>
</p>
<a name="segmentstats"></a>
<h4>
    Segment statistics
</h4>
<p>
    <ul>
       <li>{@link org.apache.lucene.index.IndexReader#maxDoc}: Returns the number of 
           documents (including deleted documents) in the index. 
       <li>{@link org.apache.lucene.index.IndexReader#numDocs}: Returns the number 
           of live documents (excluding deleted documents) in the index.
       <li>{@link org.apache.lucene.index.IndexReader#numDeletedDocs}: Returns the
           number of deleted documents in the index.
       <li>{@link org.apache.lucene.index.Fields#size}: Returns the number of indexed
           fields.
       <li>{@link org.apache.lucene.index.Fields#getUniqueTermCount}: Returns the number 
           of indexed terms, the sum of {@link org.apache.lucene.index.Terms#size}
           across all fields.
    </ul>
</p>
<a name="documentstats"></a>
<h4>
    Document statistics
</h4>
<p>
Document statistics are available during the indexing process for an indexed field: typically
a {@link org.apache.lucene.search.similarities.Similarity} implementation will store some
of these values (possibly in a lossy way), into the normalization value for the document in
its {@link org.apache.lucene.search.similarities.Similarity#computeNorm} method.
</p>
<p>
    <ul>
       <li>{@link org.apache.lucene.index.FieldInvertState#getLength}: Returns the number of 
           tokens for this field in the document. Note that this is just the number
           of times that {@link org.apache.lucene.analysis.TokenStream#incrementToken} returned
           true, and is unrelated to the values in 
           {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}.
       <li>{@link org.apache.lucene.index.FieldInvertState#getNumOverlap}: Returns the number
           of tokens for this field in the document that had a position increment of zero. This
           can be used to compute a document length that discounts artificial tokens
           such as synonyms.
       <li>{@link org.apache.lucene.index.FieldInvertState#getPosition}: Returns the accumulated
           position value for this field in the document: computed from the values of
           {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute} and including
           {@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap}s across multivalued
           fields.
       <li>{@link org.apache.lucene.index.FieldInvertState#getOffset}: Returns the total
           character offset value for this field in the document: computed from the values of
           {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute} returned by 
           {@link org.apache.lucene.analysis.TokenStream#end}, and including
           {@link org.apache.lucene.analysis.Analyzer#getOffsetGap}s across multivalued
           fields.
       <li>{@link org.apache.lucene.index.FieldInvertState#getUniqueTermCount}: Returns the number
           of unique terms encountered for this field in the document.
       <li>{@link org.apache.lucene.index.FieldInvertState#getMaxTermFrequency}: Returns the maximum
           frequency across all unique terms encountered for this field in the document. 
    </ul>
</p>
<p>
Additional user-supplied statistics can be added to the document as DocValues fields and
accessed via {@link org.apache.lucene.index.AtomicReader#getNumericDocValues}.
</p>
<p>
</body>
</html>
